Large-scale gene expression alterations introduced by structural variation drive morphotype diversification in Brassica oleracea

Brassica oleracea, globally cultivated for its vegetable crops, consists of very diverse morphotypes, characterized by specialized enlarged organs as harvested products. This makes B. oleracea an ideal model for studying rapid evolution and domestication. We constructed a B. oleracea pan-genome from 27 high-quality genomes representing all morphotypes and their wild relatives. We identified structural variations (SVs) among these genomes and characterized these in 704 B. oleracea accessions using graph-based genome tools. We show that SVs exert bidirectional effects on the expression of numerous genes, either suppressing through DNA methylation or promoting probably by harboring transcription factor-binding elements. The following examples illustrate the role of SVs modulating gene expression: SVs promoting BoPNY and suppressing BoCKX3 in cauliflower/broccoli, suppressing BoKAN1 and BoACS4 in cabbage and promoting BoMYBtf in ornamental kale. These results provide solid evidence for the role of SVs as dosage regulators of gene expression, driving B. oleracea domestication and diversification.


Library construction and sequencing
A total of 22 B. oleracea accessions were selected for de novo genome assembly in this study, among which 17 accessions were sequenced by Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences (IVF-CAAS, Beijing, China) and five accessions (T06, T24, T25, T26 and T27) were sequenced by Plant Breeding, Wageningen University & Research (PBR-WUR, Wageningen, the Netherlands).The for 16 out of 17 accessions using a previously described approach 1 , which were then sequenced on the Illumina HiSeq 2000 platform.
For the five PBR-WUR accessions, young leaves were collected and high molecular weight (HMW) DNA was extracted for each plant following a previously established protocol 2 .The SQK-LSK109 Ligation Sequencing kit (Oxford Nanopore Technologies; Oxford, UK) was used for library constructions according to manufacturer's instructions.Long-read sequencing data were generated using the Oxford Nanopore GridION platform and a run-time of 48 hr.In addition, genomic DNA was extracted from these young leaves using a cetyltrimethylammonium bromide (CTAB) method  Reagent Kit according to the manufacturer's instructions.After cluster generation, the DNA libraries were sequenced on Illumina NovaSeq 6000 platform and 150 bp pairedend reads were generated.

Luciferase report experiment
Primers used for luciferase report experiment:

Representative Morphotypes
To construct a pan-genome that encompasses the full range of genetic diversity in B.
oleracea, we analyzed the resequencing data of 704 globally distributed B. oleracea accessions covering all different morphotypes and their wild relatives (Supplementary Tables 1 and 2).Phylogenetic analysis using SNPs classified the 704 accessions into three main groups (Fig. 1a).Based on the phylogeny and morphotype diversity, we selected 22 representative accessions for de novo genome assembly (Table 1).These accessions were sequenced using PacBio or Nanopore long-read sequencing technologies with an average coverage of 133×, and Illumina short-read sequencing technology with an average coverage of 250× (Supplementary Tables 3-6).In addition, we generated an average coverage of 267× chromosome conformation capture (Hi-C) data for 16 IVF-CAAS accessions and an average coverage of 212× BioNano data for five WUR accessions.The long reads were de novo assembled into contigs that were further polished using both long-and short-reads (Methods).Seven genomes with relatively high heterozygosity (> 0.3%) (Supplementary Table 7) were purged to remove potential redundant homologous sequences.
A total of 53.5%-58.5% sequences in 27 (our 22 new genomes and five previously reported genomes) [5][6][7][8] B. oleracea genomes were annotated as TEs, compared to 47.1% in the sister species Brassica rapa 9 .The long terminal repeat retrotransposons (LTR-RTs) were the most abundant TEs, representing 26.1% to 31.2% of sequences in the 27 genomes, with Copia and Gypsy accounting for an average of 39.2% and 44.0% of all LTR-RTs, respectively.We further identified a range of 4,703 to 6,253 full-length LTR-RTs (fl-LTRs) in these genomes (Supplementary Table 9).Recently inserted fl-LTRs were enriched in centromeric regions (Fig. 1c).On average, 2,896 and 1,722 full-length Copia and Gypsy LTRs, respectively, were identified in these genomes.We revealed continuous expansion of Copia and Gypsy in all these genomes since four MYA (Fig. 1d).The number of Copia insertions has always been higher than that of Gypsy, with a recent (< 0.20 MYA) increase of Copia insertions being almost twice that of Gypsy.
Additionally, Copia TEs were clustered into more and larger groups than Gypsy based on sequence similarity (Fig. 1d).Together, these results indicate that Copia was under stronger expansion than Gypsy.

Supplementary Note 3: Homoeologous Gene Retention Variation
We constructed an orthologous pan-genome comprising the 27 B. oleracea genomes.
These proportions were much higher than those for the dispensable and private genes (42% and 18%, respectively) (Supplementary Fig. 3d).Meanwhile, both the length and the number of CDS in the core and softcore genes were significantly higher than those in the dispensable and private genes (Supplementary Figs.3e and 3f).Besides, the Ka/Ks values were much higher in the dispensable/private genes than those of the core/softcore genes (Supplementary Fig. 3g), indicating higher conservation of function of the core/softcore genes and faster evolution of the dispensable/private genes.
We further investigated the copy-number variation of homoeologs among B. oleracea morphotypes.On average, ~60.25% of all genes had at least two homoeologs in the 27 genomes, similar to previous reports (Liu et al., 2014; Wang et al., 2011).We classified the syntenic genes into different groups according to the copy number of homoeologous genes (one, two or three).We found that the percentage of core genes from the category retaining three-copy homoeologs (51.5%) was considerably lower than the percentage from the categories of two-copy (61.2%) or single-copy (65.7%) homoeologs (Fig. 2h).Interestingly, we found many three-copy homoeologs that lost one or more copies in specific morphotypes.From the 3,744 three-copy homoeologs in wild B. oleracea T10, 394 had specifically lost at least one copy in cabbage.These 394 genes were enriched in functions of cytoskeleton organization and actin binding.
Another 543 of these wild B. oleracea three-copy homoeologs had lost at least one copy specifically in cauliflower and broccoli, and were enriched in functions of sulfate transport and metal ion binding (Supplementary Tables 10 and 11).The differential loss of homoeologs could contribute to the evolution of B. oleracea morphotypes through variation in gene copy dosage, which is associated with expression dosage.
Fig. 12a).Only three out of 292 (1%) cabbage accessions contained the insertion, while this insertion was present in 232 out of 314 accessions (74%) in all other accessions (Supplementary Fig. 12b).Expression of BoACS4 in cabbage accessions lacking the insertion was significantly lower (P value = 1.90×10 -14 ) than in control accessions with the insertion (Supplementary Fig. 12c).In conclusion, the presence genotypes of the two SVs increased expression of their SV-genes BoKAN1 and BoACS4, which are under strong negative selection in cabbage.
Supplementary Table 3 17 and five accessions were grown in the greenhouse of IVF-CAAS in 2020 and in the Unifarm of WUR in 2019, respectively.For the 17 IVF-CAAS accessions, Illumina libraries were constructed using the TruSeq Nano DNA HT Sample Preparation Kit (Illumina, CA, USA).Oxford Nanopore libraries were constructed using the SQK-LSK109 Ligation Sequencing Kit.PacBio SMRT libraries were constructed based on the protocol of SMRT bell library construction (https://www.pacb.com/support/documentation/).Sequencing of these constructed libraries were performed on the Illumina HiSeq 2000 platform, and PacBio Sequel or Oxford Nanopore MinION or PromethION platforms.Hi-C libraries were constructed

Supplementary Fig. 2 .Supplementary Fig. 6 .
TAD prediction in ten chromosomes of B. rapa genome (A03 accession).The heatmap shows the TAD prediction on ten chromosomes, in which the region colored in dark red denotes a TAD structure.The line charts below the heatmap shows the density of Copia and Gypsy LTRs within 500 kb sliding windows with 50 kb step size.Examples of manual validation of SVs with lengths over 8 kb.(a) A randomly selected inversion in the T08 genome.The Hi-C reads of T08 were mapped to T10 genome at 5 kb resolution.The left graph shows the inversion region and the right graph shows the region with inversion being reversed.(b) A randomly selected inversion in T04 genome.The Hi-C reads of T04 were mapped to the T10 genome at 5 kb resolution.The left graph shows the inversion region and the right graph shows the region with inversion being reversed.(c) A randomly selected deletion in the T04 genome.The left panel: the Hi-C reads of T04 were mapped to T10 genome.The right panel: the Hi-C reads of T10 mapped to T04 genome. 3

Supplementary Table 18). Briefly
extracted from fresh tissue of the five morphotypes using the Bionano Prep™ Plant Tissue DNA Isolation Kit.The Direct Label and Stain (DLS) technology, together with Bionano Saphyr platform, were used for generation of optical mapping data.DLS labeling was performed with 750 ng DNA using the Direct Labeling and . Illumina libraries with ~450 bp and ~600 bp insertion sizes were constructed and the resulting libraries were sequenced on Illumina Hiseq 2500 (~450 bp libraries) and X10 (~600 bp libraries) platforms.Moreover, high Molecular Weight plant DNA was using Illumina 8 bp dual index primers.The constructed WGBS libraries were then analyzed by Agilent 2100 Bioanalyzer.The clustering of the index-coded samples was performed on a cBot Cluster Generation System using Illumina Novaseq 6000 S4 The first group included accessions of wild B. oleracea and different kales.The distal cluster of the second group was composed of broccoli and cauliflower with separated branches for Chinese kale, Tronchuda, and kohlrabi (hereafter referred to as 'AIL', abbreviation for arrested inflorescence lineage and closely related morphotypes).The third group consisted of the leafy head cabbage accessions, plus ornamental kale and Brussels sprouts (hereafter referred to as 'LHL', abbreviation for leafy head lineage and closely related morphotypes).

.
Statistics of the long-read sequencing data for B. oleracea genomes.

Table 4 .
Summary of Illumina short-reads sequencing data for B. oleracea genomes.

Table 8 .
BUSCO evaluation scores for the assembled genomes of B. oleracea.